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Abstract — The  Defense  Applied  Research  Projects  Agency 
(DARPA)  Learning  Applied  to  Ground  Vehicles  (LAGR)  program 
aims  to  develop  algorithms  for  autonomous  vehicle  navigation  that 
learn  how  to  operate  in  complex  terrain.  For  the  LAGR  program,  The 
National  Institute  of  Standards  and  Technology  (NIST)  has 
embedded  learning  into  a  control  system  architecture  called  4D/RCS 
to  enable  the  small  robot  used  in  the  program  to  learn  to  navigate 
through  a  range  of  terrain  types.  This  paper  describes  performance 
evaluation  experiments  on  one  of  the  algorithms  developed  under  the 
program  to  learn  terrain  traversability.  The  algorithm  uses  color  and 
texture  to  build  models  describing  regions  of  terrain  seen  by  the 
vehicle’s  stereo  cameras.  Range  measurements  from  stereo  are  used 
to  assign  traversability  measures  to  the  regions.  The  assumption  is 
made  that  regions  that  look  alike  have  similar  traversability.  Thus, 
regions  that  match  one  of  the  models  inherit  the  traversability  stored 
in  the  model.  This  allows  all  areas  of  images  seen  by  the  vehicle  to 
be  classified,  and  enables  a  path  planner  to  determine  a  traversable 
path  to  the  goal. 

The  algorithm  is  evaluated  by  comparison  with  ground  truth 
generated  by  a  human  observer.  A  graphical  user  interface  (GUI)  was 
developed  that  displays  an  image  and  randomly  generates  a  point  to 
be  classified.  The  human  assigns  a  traversability  label  to  the  point, 
and  the  learning  algorithm  associates  its  own  label  with  the  point. 
When  a  large  number  of  such  points  have  been  labeled  across  a 
sequence  of  images,  the  performance  of  the  learning  algorithm  is 
determined  in  terms  of  error  rates.  The  learning  algorithm  is  outlined 
in  the  paper,  and  results  of  performance  evaluation  are  described. 
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I.  Introduction 

The  Defense  Applied  Researeh  Projeets  Ageney  (DARPA) 
Learning  Applied  to  Ground  Vehieles  (LAGR)  program  [1] 
aims  to  develop  algorithms  for  autonomous  vehiele  navigation 
that  learn  how  to  operate  in  eomplex  terrain.  Over  many  years, 
the  National  Institute  of  Standards  and  Teehnology  (NIST)  has 
developed  a  referenee  model  eontrol  system  arehiteeture 
ealled  4D/RCS  that  has  been  applied  to  many  kinds  of  robot 
eontrol,  ineluding  autonomous  vehiele  eontrol  [2].  For  the 
LAGR  program,  NIST  has  embedded  learning  into  a  4D/RCS 


eontroller  to  enable  the  small  robot  used  in  the  program  to 
learn  to  navigate  through  a  range  of  terrain  types  [3].  The 
vehiele  learns  in  several  ways.  These  inelude  learning  by 
example,  learning  by  experienee,  and  learning  how  to 
optimize  traversal.  In  this  paper,  we  present  a  method  of 
evaluating  a  learning  algorithm  used  in  LAGR  that  assoeiates 
terrain  appearanee  with  traversability.  The  paper  briefly 
deseribes  the  learning  method  and  then  foeuses  on  the 
evaluation  proeedure.  The  approaeh  is  illustrated  with 
examples  taken  from  tests  run  by  the  LAGR  evaluation  team. 

The  appearanee  of  regions  in  an  image  has  been  deseribed  in 
many  ways,  but  most  frequently  in  terms  of  eolor  and/or 
texture.  Ulrieh  and  Nourbakhsh  [4]  used  eolor  imagery  to 
learn  the  appearanee  of  a  set  of  loeations  to  enable  a  robot  to 
reeognize  where  it  is.  A  set  of  images  was  reeorded  at  eaeh 
loeation  and  served  as  deseriptors  for  that  loeation.  Images 
were  represented  by  a  set  of  one -dimensional  histograms  in 
both  HLS  (hue,  luminanee,  saturation)  and  normalized  Red, 
Green,  and  Blue  (RGB)  eolor  spaees.  When  the  robot  needed 
to  reeognize  its  loeation,  it  eompared  its  eurrent  image  with 
the  set  of  images  assoeiated  with  loeations.  The  loeation  was 
reeognized  as  that  assoeiated  with  the  best-matehing  stored 
image. 

In  [5]  the  authors  also  addressed  the  issue  of 
appearanee-based  obstaele  deteetion  using  a  single  eolor 
eamera  and  no  range  information.  Their  approaeh  makes  the 
assumptions  that  the  ground  is  flat  and  that  the  region  direetly 
in  front  of  the  robot  is  ground.  This  region  is  eharaeterized  by 
eolor  histograms  and  used  as  a  model  for  ground.  In  the 
domain  of  road  deteetion,  a  related  approaeh  is  deseribed  in 
[6].  In  prineiple,  the  method  eould  be  extended  to  deal  with 
more  elasses,  and  our  algorithm  ean  be  seen  as  one  sueh 
extension  that  does  not  need  to  make  the  assumptions  beeause 
of  the  availability  of  range  information  for  regions  elose  to  the 
vehiele. 

Learning  has  been  applied  to  eomputer  vision  for  a  variety 
of  applieations,  ineluding  traversability  predietion.  Wellington 
and  Stentz  [7]  predieted  the  load-bearing  surfaee  under 
vegetation  by  extraeting  features  from  range  data  and 
assoeiating  them  with  the  aetual  surfaee  height  measured 
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when  the  vehicle  drove  over  the  corresponding  terrain.  The 
system  learned  a  mapping  from  terrain  features  to  surface 
height  using  a  technique  called  locally  weighted  regression. 
Learning  was  done  in  a  map  domain.  We  also  use  a  map  in  the 
current  work,  although  it  is  a  two  dimensional  (2D)  rather 
than  a  three  dimensional  (3D)  map,  and  we  also  make  use  of 
the  information  gained  when  driving  over  terrain  to  update 
traversability  estimates,  although  not  as  the  primary  source  of 
traversability  information.  The  models  we  construct  are  not 
based  on  range  information,  however,  since  this  would 
prevent  the  extrapolation  of  the  traversability  prediction  to 
regions  where  range  is  not  available. 

Howard  et  al.  [8]  presented  a  learning  approach  to 
determining  terrain  traversability  based  on  fuzzy  logic.  A 
human  expert  was  used  to  train  a  fuzzy  terrain  classifier  based 
on  terrain  roughness  and  slope  measures  computed  from 
stereo  imagery.  The  fuzzy  logic  approach  was  also  adopted  by 
Shirkhodaie  et  al.  [9],  who  applied  a  set  of  texture  measures  to 
windows  of  an  image  followed  by  a  fuzzy  classifier  and 
region  growing  to  locate  traversable  parts  of  the  image. 

Talukder  and  his  colleagues  [10]  describe  a  system  that 
attempts  to  classify  terrain  based  on  color  and  texture.  Terrain 
is  segmented  using  labels  generated  from  a  3D  obstacle 
detection  algorithm.  Each  segment  is  described  in  terms  of 
Gabor  texture  measures  and  color  distributions.  Based  on 
color  and  texture,  the  segments  are  assigned  to  pre-existing 
classes.  Each  class  is  associated  with  an  a  priori  traversability 
measure  represented  by  a  spring  with  known  spring  constant. 
We  also  make  use  of  3D  obstacle  detection  in  our  work,  but 
don’t  explicitly  segment  the  data  into  regions.  We  model  both 
background  and  obstacle  classes  using  color  and  texture,  but 
all  models  are  created  as  the  vehicle  senses  the  world.  Given 
that  we  have  no  prior  knowledge  of  the  type  of  terrain  that 
may  be  encountered,  it  is  usually  not  possible  to  pre-specify 
the  classes.  Similarly,  the  vehicle  learns  the  traversability  of 
the  terrain  by  interacting  with  it,  either  by  driving  over  it  or 
generating  a  bumper  hit. 

II.  The  learning  Algorithm 

The  learning  process  takes  input  in  the  form  of  labeled 
pixels  with  associated  (x,  y,  z)  positions.  The  labels  are 
provided  on  a  pixel-by-pixel  basis  by  an  obstacle  detection 
algorithm  that  works  on  stereo  data  [11].  Given  the  labels  and 
color  characteristics  of  the  pixels,  the  learning  algorithm 
constructs  color  and  texture  models  of  traversable  and 
non-traversable  regions  and  uses  them  for  terrain 
classification.  The  approach  to  model  building  is  to  make  use 
of  the  labeled  color  data  to  describe  regions  in  the 
environment  around  the  vehicle  and  to  associate  a  cost  of 
traversing  each  region  with  its  description.  The  terrain  models 
are  learned  using  an  unsupervised  scheme  that  makes  use  of 
both  geometric  and  appearance  information. 

In  our  algorithm  an  assumption  is  made  that  terrain  regions 
that  look  similar  will  have  similar  traversability  The  learning 
works  as  follows  (see  [12]).  The  system  constructs  a  map  of  a 
40  m  by  40  m  region  of  terrain  surrounding  the  vehicle,  with 


map  cells  of  size  0.2  m  by  0.2  m  and  the  vehicle  in  the  center 
of  the  map.  The  map  is  always  oriented  with  one  axis  pointing 
north  and  the  other  east.  The  map  scrolls  under  the  vehicle  as 
the  vehicle  moves,  and  cells  that  scroll  off  the  end  of  the  map 
are  forgotten.  Cells  that  move  onto  the  map  are  cleared  and 
made  ready  for  new  information. 

The  model-building  algorithm  takes  as  input  the  color  image, 
the  associated  and  registered  range  data  (x,  y,  z  points),  and 
the  labels  (GROUND  and  OBSTACLE)  generated  by  the 
obstacle  detection  algorithm.  Also  associated  with  these  data 
is  the  location  and  pose  of  the  vehicle  when  the  data  were 
collected.  When  new  data  are  received,  the  vehicle  location 
and  pose  information  are  used  to  scroll  the  map  so  that  the 
vehicle  occupies  the  center  cell  of  the  map. 

Points  are  projected  into  cells  based  on  their  3D  positions. 
Each  cell  receives  all  points  that  fall  within  the  square  region 
in  the  world  determined  by  the  location  of  the  cell,  regardless 
of  the  height  of  the  point  above  the  ground.  The  cell  to  which 
the  point  projects  accumulates  information  that  summarizes 
the  characteristics  of  all  points  seen  by  this  cell.  This  includes 
color,  texture,  and  contrast  properties  of  the  projected  points, 
as  well  as  the  number  of  OBSTACLE  and  GROUND  points 
that  have  projected  into  the  cell.  Color  is  represented  by  ratios 
R/G,  G/B,  and  intensity.  The  intensity  and  color  ratios  are 
represented  by  8-bin  histograms  stored  in  a  normalized  form 
so  that  they  can  be  viewed  as  probabilities  of  the  occurrence 
of  each  ratio.  Texture  and  contrast  are  computed  using  Local 
Binary  Patterns  (LBP)  [13].  These  patterns  represent  the 
relationships  between  pixels  in  a  3x3  neighborhood  in  the 
image,  and  their  values  range  from  0  to  255.  The  texture 
measure  is  represented  by  a  histogram  with  8  bins,  also 
normalized.  Contrast  is  represented  by  a  single  number 
ranging  from  0  to  1 . 

When  a  cell  accumulates  enough  points  it  is  ready  to  be 
considered  as  a  model.  We  determine  the  sample  size  by 
requiring  95%  confidence  that  the  sample  represents  the  true 
distribution.  In  order  to  build  a  model,  we  also  require  that 
95%  of  the  points  projected  into  a  cell  have  the  same  label 
(OBSTACLE  or  GROUND).  If  a  cell  is  the  first  to 
accumulate  enough  points,  its  values  are  copied  to  instantiate 
the  first  model.  Models  have  exactly  the  same  structure  as 
cells,  so  this  is  trivial.  If  there  are  already  defined  models,  the 
cell  is  matched  to  the  existing  models  to  see  if  it  can  be 
merged  or  if  a  new  model  must  be  created.  Matching  is  done 
by  computing  a  weighted  sum  of  the  squared  difference  of  the 
elements  of  the  model  and  the  cell.  Cells  that  are  similar 
enough  are  merged  into  existing  models;  otherwise,  new 
models  are  constructed. 

At  this  stage,  there  is  a  set  of  models  representing  regions 
whose  appearance  in  the  color  images  is  distinct  (Fig  1).  Our 
interest  is  not  so  much  in  the  appearance  of  the  models,  but  in 
the  traversability  of  the  regions  associated  with  them. 
Traversability  is  computed  from  a  count  of  the  number  of 
GROUND  and  OBSTACLE  points  that  have  been  projected 
into  each  cell,  and  accumulated  into  the  model.  Models  are 
given  traversability  values  computed  as  Nobstacle  /  (Nqround 
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+  Nobstacle)-  These  models  eorrespond  to  regions  learned  by 
example. 

Learning  by  experienee  is  used  to  modify  the  models.  As  the 
vehiele  travels,  it  moves  from  eell  to  eell  in  the  map.  If  it  is 
able  to  traverse  a  eell  that  has  an  assoeiated  model,  the 
traversability  of  that  model  is  inereased.  If  it  hits  an  obstaele 
in  a  eell,  the  traversability  is  deereased. 


Fig  1.  Examples  of  histograms  used  to  eonstruet  models. 

Top  row  eorresponds  to  the  blue  regions  in  the  left  image. 

Middle  row  eorresponds  to  the  green  region.  Bottom  row 
eorresponds  to  the  red  region.  The  blue  region  is  not 
traversable,  while  the  other  two  regions  are  traversable. 

To  elassify  a  seene,  only  the  eolor  image  is  needed  (no  range 
data).  A  window  is  passed  over  the  image  and  eolor,  texture, 
and  intensity  histograms  and  a  eontrast  value  are  eomputed  as 
in  model  building.  A  eomparison  is  made  with  the  set  of 
models,  and  the  window  is  elassified  with  the  best  matehing 
model,  if  a  suffieiently  good  mateh  value  is  found.  Regions 
that  do  not  find  good  matehes  are  left  unelassified.  Windows 
that  mateh  with  models  inherit  the  traversability  measure 
assoeiated  with  the  model.  In  this  way  large  portions  of  the 
image  are  elassified  (Fig  2). 

The  vehiele  needs  to  know  the  loeations  of  obstaele  and 
ground  regions,  but  has  no  stereo  information  during 
elassifieation.  To  address  this  problem,  the  assumption  is 
made  that  the  ground  is  flat,  i.e.,  that  the  pose  of  the  vehiele 
defines  a  ground  plane  through  the  wheels.  This  allows 
windows  that  mateh  with  models  to  be  mapped  to  3D 
loeations.  Another  assumption  is  that  all  obstaeles  (windows 
matehing  with  models  ereated  from  obstaele  points)  are 


normal  to  the  ground  plane.  This  allows  obstaele  windows  to 
be  projeeted  into  the  ground  plane  and  thus  to  aequire  3D 
loeations.  Beeause  of  the  ground  plane  assumption,  the 
algorithm  only  proeesses  the  image  from  in  front  of  the 
vehiele  to  a  small  distanee  above  the  horizon,  to  eateh  the 
obstaeles  but  ignore  the  sky. 


Fig  2.  Top:  Left  and  right  eye  views  of  a  typieal  seene  from 
Test  9.  Bottom:  Classifieation  showing  regions  that  are 
traversable  in  yellow,  and  not  traversable  in  magenta. 


III.  Evaluating  the  Algorithm 

The  entire  LAGR  system  was  tested  over  the  eourse  of  a 
year  by  a  separate  Government  team  using  a  vehiele 
funetionally  identieal  to  the  vehieles  on  whieh  the  software  is 
developed.  Tests  oeeurred  about  onee  a  month.  Developers 
sent  their  eontrol  software  on  flash  memory  eards  to  the  test 
faeility.  The  software  was  loaded  onto  a  vehiele  whieh  was 
eommanded  to  travel  from  a  start  waypoint  to  a  goal  waypoint 
through  an  obstaele-rieh  environment.  The  environment  was 
not  seen  in  advanee  by  the  development  teams.  The 
Government  team  measured  the  performanee  of  the  system  on 
multiple  runs.  To  demonstrate  learning,  performanee  was 
expeeted  to  improve  from  run  to  run  as  the  systems  beeame 
familiar  with  the  eourse.  While  these  tests  gave  a  good 
indieation  of  how  learning  improved  the  overall  performanee, 
they  did  not  provide  evaluations  of  individual  learning 
algorithms. 

Evaluating  the  algorithm  deseribed  in  this  paper  requires 
determining  how  well  the  learned  models  enable  the  system  to 
elassify  the  degree  of  traversability  of  the  terrain  around  the 
vehiele.  The  evaluation  makes  use  of  ground  tmth  generated 
by  one  or  more  human  observers  who  use  a  graphieal  tool  to 
generate  ground  truth  points  against  whieh  the  learning 
algorithm  is  eompared. 

Data  sets  used  for  the  evaluation  eonsist  of  log  files 
generated  during  the  tests  eondueted  by  the  Government  team. 
Log  files  eontain  the  sequenee  of  images  eolleeted  by  the  two 
pairs  of  stereo  eameras  on  the  LAGR  vehiele  and  information 
from  the  other  sensors,  ineluding  the  navigation  (GPS  and 
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INS)  sensors  and  bumper  sensors  (physieal  and  IR  bumpers). 
The  NIST  LAGR  system  performs  exaetly  the  same  when 
playing  baek  a  log  fde  as  it  did  when  it  first  ran  the  eourse,  so 
long  as  no  ehanges  are  made  in  the  algorithms.  Therefore, 
logged  data  is  a  good  souree  for  performanee  testing. 

The  ground  truth  is  eolleeted  by  a  human  stepping 
sequentially  through  the  log  file,  and  elassifying  one  or  more 
points  from  eaeh  image.  A  graphieal  tool  is  used  to  display  the 
image  and  randomly  seleet  a  point  (Fig.  3).  The  point  is 
highlighted  for  the  user,  who  seleets  one  of  the  labels  Ground 
(G),  Obstaele  (O),  or  Unknown  (U).  The  tool  then  writes  a 
reeord  to  a  file  eontaining  the  frame  number,  eoordinates  of 
the  seleeted  point,  and  the  label  provided  by  the  user.  Note 
that  the  Unknown  label  is  used  for  points  that  are  neither 
ground  nor  obstaele  (sueh  as  sky)  as  well  as  points  where  the 
human  tmly  eannot  deeide  between  ground  and  obstaele  (sueh 
as  at  the  base  of  an  obstaele  that  merges  smoothly  with  the 
ground).  When  ground  truth  eolleetion  is  eomplete,  the  file  is 
available  for  evaluating  the  performanee  of  the  learning 
algorithm  (or  any  other  algorithm  that  assigns  traversability 
labels  to  regions). 


Fig.  3.  The  GUI  for  generating  ground  truth  showing  a 
frame  from  Test  7. 


The  learning  algorithm  reads  the  ground  truth  file  and  the 
log  file.  It  proeesses  the  log  file  as  it  usually  does  when 
running  on  the  vehiele.  Eaeh  time  it  eomes  to  an  image  frame 
for  whieh  ground  truth  is  available,  it  elassifies  the  points 
seleeted  in  the  frame  and  writes  out  a  file  eontaining  the 
ground  truth  it  read  in  plus  an  entry  giving  the  learned 
elassifieation  of  the  pixel  in  the  ground  truth  file.  When  the 
entire  log  file  has  been  proeessed,  the  output  file  eontains  an 
entry  for  eaeh  ground  truth  point  that  gives  both  the  huruan’s 
elassifieation  and  the  system’s  elassifieation.  Under  the 
assumption  that  the  human’s  elassifieation  is  eorreet,  an 
analysis  ean  be  eondueted  of  the  errors  eommitted  by  the 
learning  algorithm. 


IV  Results 

The  evaluation  was  applied  to  a  nuruber  of  examples  taken 
from  data  gathered  by  the  LAGR  evaluation  team  at  loeations 
in  Virginia  and  Texas.  Results  are  shown  for  these  examples 
and  an  overall  evaluation  is  given  of  the  performanee  of  the 
algorithm  aeross  all  the  data  sets. 

In  the  evaluations,  the  learning  system  starts  out  with  no 
models.  This  is  how  the  system  typieally  starts,  at  least  for  the 
first  test  run  at  eaeh  loeation.  As  it  reads  the  log  file  and  the 
ground  truth  data,  the  learning  program  both  ereates  the 
models  and  classifies  the  ground  truth  points.  This  means  that 
early  in  the  sequence  of  images,  only  a  sruall  number  of 
models  are  available  for  classification.  As  more  of  the  terrain 
is  seen,  more  models  are  constructed,  and  the  range  of  regions 
that  can  be  classified  increases.  The  algorithm  learns  very  fast, 
however,  often  creating  the  first  few  models  from  the  first 
frarue  or  two  of  data.  Since  the  terrain  doesn’t  usually  change 
abruptly,  classification  performs  well  from  the  start, 
particularly  for  points  close  to  the  vehicle. 

Four  sets  of  ground  truth  data  were  created  by  three 
different  people  using  the  GUI  in  Fig.  3.  The  data  were  taken 
from  log  files  of  three  different  tests.  Test  6  was  conducted  in 
September,  2005  in  Fort  Belvoir,  VA.  Test  7  was  also 
conducted  at  Fort  Belvoir,  in  October,  2005.  The  course  was 
very  different,  however.  Test  9  was  conducted  in  San  Antonio, 
TX  at  the  Soutwest  Research  Institute’s  Small  Robot  Testbed. 


A.  Test  6 


Test  6  included  a  run  along  a  path  through  a  slightly  wooded 
area,  ending  in  an  open  field.  Two  synthetic  obstacles  made 
out  of  orange  plastic  mesh  were  placed  in  the  path  of  the 
vehicle  (Fig.  4),  with  the  goal  being  to  learn  that  the  first 
fence  represented  an  obstacle  and  use  that  knowledge  to  avoid 
the  second  fence. 


Fig.  4.  A  view  of  the  first  orange  fence  in  Test  6. 


The  ground  truth  created  for  Test  6  consisted  of 
approximately  3  points  per  frame,  using  the  log  file  of  the  first 
test  run.  Because  the  human  sometimes  labeled  a  point  as 
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Unknown,  and  because  some  of  the  points  randomly  selected 
for  ground  truth  were  in  the  sky,  the  actual  number  of  usable 
points  was  closer  to  2  per  frame  (there  were  1,270  frames). 

TABLE  I  shows  a  summary  of  the  results  of  the  evaluation. 
As  can  be  seen,  the  algorithm  labeled  87%  of  the  points  with 
the  same  class  as  the  human.  Of  the  incorrect  labels,  30% 
arose  from  situations  where  the  algorithm  did  not  find  a  match 
with  any  model  and  labeled  the  points  Unknown,  52%  came 
from  incorrectly  labeling  points  as  Obstacle  instead  of  Ground, 
and  17%  from  labeling  points  as  Ground  instead  of  Obstacle. 

TABLE  I 
Results  for  Test  6 


Test  6, 2513  Ground  Truth  Points 

No.  Correct 

No. 

Incorrect 

%  Correct 

%  Incorrect 

2197 

317 

87.4% 

12.6% 

Error  Distribution  Across  Label  Types 

Not  Classified 
(Unknown) 

Obstacle  instead 
of  Ground 

Ground  instead 
of  Obstacle 

30% 

52% 

17% 

B.  Test  7 

The  course  for  Test  7  began  in  an  open  field.  The 
straight-line  path  would  put  the  vehicle  in  a  position  that 
required  a  long  detour  through  dense  bushes.  Traveling  to  the 
right  of  the  straight-line  path  led  to  an  easy  route  to  the  goal. 
The  Government  team  placed  an  artificial  barrier  in  the  path 
to  make  it  difficult  to  choose  the  right  hand  direction  the  first 
time  the  course  was  seen  (Fig.  5).  The  idea  was  that  the 
vehicle  would  fight  its  way  through  the  bushes  on  the  first  run 
before  reaching  the  goal,  but  would  learn  to  recognize  the 
barrier  and  select  the  right  hand  route  on  subsequent  test  runs. 
In  fact,  this  is  what  the  NIST  vehicle  did. 


Fig.  5.  A  view  of  the  Test  7  course  from  the  vehicle  (on  the 
wrong  side  of  the  barrier). 


The  ground  truth  for  Test  7  was  created  from  the  log  file  of 
the  first  test  run.  Two  different  people  generated  ground  truth 
files.  One  selected  1  point  per  frame,  resulting  in  a  usable 
count  of  702  points,  while  the  other  selected  3  points  per 
frame,  resulting  in  a  usable  count  of  2195  points,  where 
usable  points  are  determined  as  described  above  for  Test  6. 
Flaving  different  selections  of  points  for  the  same  data  set 
enabled  us  to  see  if  there  was  any  significant  variation 
between  people’s  selection  of  labels  and  also  let  us  see  if  a 
smaller  number  of  points  was  as  effective  as  a  larger  one. 

As  can  be  seen  in  TABLE  II  and  TABLE  III,  the  results  for 
both  the  small  sample  size  and  the  large  one  are  very  similar, 
indicating  that  it  is  not  necessary  to  label  large  numbers  of 
points.  What  was  surprising  was  that  the  distribution  of  the 
errors  was  different.  For  the  smaller  set,  the  percentage  of 
errors  due  to  the  learning  algorithm  not  being  able  to  identify 
the  class  of  the  point  was  46%,  whereas  the  corresponding 
percentage  for  the  larger  set  was  71%.  In  the  tests  we  have 
done,  the  distributions  of  errors  with  different  random  sets  of 
points  has  not  shown  any  obvious  pattern. 

TABLE  II 

Results  for  Test  7,  User  1 


Test  7,  702  Ground  Truth  Points 

No.  Correct 

No. 

Incorrect 

%  Correct 

%  Incorrect 

592 

no 

84.5% 

15.5% 

Error  Distribution  Across  Label  Types 

Not  Classified 
(Unknown) 

Obstacle  instead 
of  Ground 

Ground  instead 
of  Obstacle 

47% 

34% 

19% 

TABLE  III 

Results  for  Test  7,  User  2 


Test  7,  2195  Ground  Truth  Points 

No.  Correct 

No. 

Incorrect 

%  Correct 

%  Incorrect 

1884 

312 

85.8% 

14.2% 

Error  Distribution  Across  Label  Types 

Not  Classified 
(Unknown) 

Obstacle  instead 
of  Ground 

Ground  instead 
of  Obstacle 

71% 

4% 

25% 

C.  Test  9 


Test  9  was  conducted  in  the  desert  in  December,  2005.  The 
terrain  was  vegetated  with  both  woodland  and  grassland 
features.  The  vegetation  was  dry,  and  there  was  not  much 
color  difference  between  the  vegetation  and  the  ground  (Fig. 
6).  The  course  ran  along  a  mowed  path  through  the  terrain,  but 
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there  were  other  paths  erossing  the  desired  path  which  did  not 
provide  a  traversable  route  to  the  goal.  The  Government  test 
team  expected  the  vehicles  to  explore  the  side  paths  on  the 
first  run,  but  learn  that  they  were  not  productive  and  follow 
the  preferred  path  on  later  runs.  This  is  what  the  NIST  vehicle 
did. 


Fig.  6.  A  view  of  the  terrain  in  Test  9. 


The  ground  truth  for  Test  9  was  created  from  the  log 
file  of  the  first  run,  using  a  single  point  from  each  frame  and  a 
total  of  only  176  frames.  There  were  a  total  of  290  points  to 
be  classified.  As  can  be  seen  in  TABLE  IV,  the  system 
performed  a  little  worse  in  this  low-color  environment,  but 
still  respectably. 

TABLE  IV 
Results  for  Test  9 


Test  9,  290  Ground  Truth  Points 

No.  Correct 

No. 

Incorrect 

%  Correct 

%  Incorrect 

232 

58 

80.3% 

20.1% 

Error  Distribution  Across  Label  Types 

Not  Classified 
(Unknown) 

Obstacle  instead 
of  Ground 

Ground  instead 
of  Obstacle 

19% 

21% 

60% 

D.  Cumulative  Results 

The  results  of  all  the  performance  evaluations  are 
accumulated  in  TABLE  V.  As  can  be  seen,  86%  of  the  time 
the  algorithm  assigns  similar  labels  to  regions  as  do  human 
observers. 


TABLE  V 
Cumulative  Results 


Tests  6,  7,  and  9,  5701  Ground  Truth  Points 

Number  of  points  classified 

5701 

Number  correct 

4905 

Number  incorrect 

797 

Percentage  correct 

86% 

Percentage  incorrect 

14% 

IV  Evaluating  Algorithm  Parameters 

Another  way  of  using  the  ground  truth  data  is  to  investigate 
the  effects  of  the  model  parameters.  We  use  five  parameters, 
and  here  we  discuss  the  effects  of  selecting  subsets  of  these 
parameters.  We  explored  using  only  color  (no  intensity  or 
texture),  using  color  plus  intensity  with  no  texture,  and  not 
using  color.  There  are  two  color  components,  R/G  and  G/B. 
We  did  not  explore  removing  only  one  of  them.  Nor  did  we 
look  at  the  effects  of  contrast.  Some  of  the  results  were 
surprising. 

TABLE  VI 

Effects  on  Classification  of  Changing  Model  Parameters 


Test  7  Model  Parameter  Variation 

No  Texture 

No  Color 

Only  Color 

% 

Correct 

% 

Incorrect 

% 

Correct 

% 

Incorrect 

% 

Correct 

% 

Incorrect 

83.52% 

16.48% 

53.26% 

46.79% 

86.25% 

13.75% 

Test  9  Model  Parameter  Variation 

No  Texture 

No  Color 

Only  Color 

% 

Correct 

% 

Incorrect 

% 

Correct 

% 

Incorrect 

% 

Correct 

% 

Incorrect 

82.35% 

17.99% 

76.12% 

24.22% 

56.40% 

43.94% 

TABLE  VI  shows  the  classification  success  of  the  algorithm 
when  it  learns  models  with  one  or  more  features  removed.  It 
appears  that  removing  texture  has  hardly  any  effect.  The 
percentage  of  correct  classifications  for  Test  7  goes  down 
marginally  (just  over  2%),  but  the  correct  classification  for 
Test  9  goes  up  (about  2%)!  This  is  very  surprising,  since  the 
data  for  Test  9  showed  little  color  variation,  so  we  assumed 
that  the  texture  was  providing  most  of  the  discrimination.  It 
probably  means  that  the  texture  measure  we  used  is  not 
suitable  for  this  application  (perhaps  because  it  uses  such  a 
small  neighborhood). 

On  the  other  hand,  taking  color  out  of  the  model  features  has 
a  big  impact,  dropping  the  classification  accuracy  in  Test  7 
from  about  86%  to  53%.  For  Test  9  the  accuracy  also  drops, 
but  only  from  80%  to  76%.  This  is  reasonable,  since  the  data 
showed  so  little  color  variation. 
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Finally,  if  only  color  is  used,  the  performance  on  Test  9 
degrades  considerably,  from  80%  to  56%.  The  performance 
on  Test  7  actually  goes  up  marginally,  although  probably  not 
significantly.  We  can  conclude  that  intensity  plays  a 
significant  role  in  classification,  especially  in  Test  9.  Color  is 
clearly  important,  but  the  use  of  the  Local  Binary  Pattern 
operator  is  questionable. 

V.  Conclusion 

Knowing  the  traversability  of  terrain  is  very  important  to  a 
robot  that  navigates  off-road  like  those  in  the  DARPA  LAGR 
project.  At  NIST,  we  have  developed  several  methods  of 
learning  traversability  for  use  in  the  LAGR  program.  In  this 
paper,  we  discussed  our  method  of  evaluating  the  performance 
of  an  algorithm  that  learns  to  classify  terrain  as  either 
traversable  or  not  traversable  based  on  models  it  builds  using 
color  and  texture  features  of  the  terrain. 

The  performance  evaluation  is  not  specific  to  the  particular 
algorithm  shown  in  this  paper.  Once  a  human  has  generated  a 
set  of  ground  truth  points,  they  can  be  used  to  evaluate  any 
classification  algorithm.  It  is  straightforward  to  modify  the 
number  of  classes  the  user  has  available  to  classify  the  points, 
although  too  many  classes  may  lead  to  a  higher  rate  of  human 
error  in  classifying  the  points.  The  evaluation  was  also  applied 
to  the  stereo  obstacle  detection  algorithm  that  provides  the 
input  for  the  learning  algorithm  and  in  some  sense  determines 
the  best  perfomiance  that  can  be  expected  of  it.  The  results 
showed  that  the  obstacle  detection  algorithm  agreed  with 
human  classification  91%  of  the  time. 

The  random  nature  in  which  the  points  to  be  classified  are 
selected  has  the  advantage  of  preventing  any  bias  in  the  way 
that  the  image  sequence  is  sampled.  It  has  a  problem,  however, 
in  that  it  is  not  possible  to  say  anything  about  the  way  the 
errors  are  distributed  in  the  images.  There  is  a  significant 
difference  between  errors  that  congregate  at  the  boundaries  of 
regions  and  those  that  appear  throughout  the  image.  Usually, 
errors  close  to  boundaries  are  less  of  a  concern  since  they 
amount  to  a  disagreement  about  where  the  boundary  actually 
occurs.  Thus,  two  algorithms  with  the  same  performance  in 
terms  of  correct  classifications  could  differ  greatly  in  their 
utility.  The  method  used  in  this  paper  cannot  provide  a 
distinction  based  on  error  locations,  but  a  quick  scan  of 
images  such  as  Fig  2  gives  a  good  idea  of  the  error 
distribution. 

It  should  be  pointed  out  that  the  results  shown  in  this  paper 
do  not  take  into  account  some  postprocessing  that  is  done  in 
the  algorithm  after  an  image  frame  is  classified  but  before  the 
results  are  sent  to  the  planner.  This  involves  removing 
singleton  blocks  (16x16  windows  of  pixels)  classified  as  one 
type  that  he  within  a  region  of  the  opposite  type  (e.g.,  a  single 
non-traversable  block  within  a  traversable  region  as  can  be 
seen  in  Fig  2).  Usually  such  blocks  are  the  result  of  incorrect 
classification  so  removing  them  improves  the  overall 
performance  of  the  algorithm.  In  one  of  the  tests  (Test  10), 
however,  the  vehicles  had  to  make  their  way  through  a  set  of 
thin  posts  randomly  placed  in  a  field.  By  removing  singleton 


blocks,  the  locations  of  some  of  the  posts  that  had  been 
correctly  recognized  by  the  algorithm  as  not  traversable  were 
lost. 

It  is  very  helpful  to  be  able  to  use  the  performance 
evaluation  to  tune  the  algorithm  by  determining  the  useful 
features  and  their  relative  contributions  to  the  final 
classification.  Our  evaluation  showed  that  the  texture  operator 
was  not  performing  effectively  and  that  using  intensity  as  a 
feature  is  beneficial.  We  plan  to  explore  alternative  texture 
measures  based  on  multiresolution  Gabor  filters  as  in  [10]  to 
see  if  they  perform  better. 

Overall,  the  results  show  that  the  algorithm  for  learning 
traversability  works  well,  with  a  high  degree  of  agreement 
between  its  classifications  and  those  of  a  human  observer. 
This  provides  confidence  that  the  algorithm  will  enhance  the 
perfomiance  of  the  LAGR  control  system  as  a  whole. 
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